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ABSTRACT 

This  paper  studies  a  set  of  2-D  basic  algorithms  for  SIMD  perfect  shuffle 
networks.  Those  algorithms  where  studied  in  [1]  and  [SN],  but  for  the  1-D 
case,  where  the  size  of  the  problem  N  is  the  same  as  the  number  of  proces- 
sors P .  For  the  2-D  case  of  N  =  L*P ,  we  improve  some  algorithms  to  be  "2- 
D  efficient",  by  improving  their  running  time  from  0(L*logP)  to 
0(L  +  logP),  as  N  exceeds  P.  We  give  nontrivial  algorithms  for  the  follow- 
ing operations:  Row-Reduction,  Parallel-Prefix,  Transpose,  Packing, 
Smoothing  and  Cartesian-Product. 

1.   Introduction  and  the  Model 

Schwartz  [1]  introduced  the  idea  of  a  parallel  processor  based  on  the  shuffle  connections 
[2];  reviewed  various  basic  of  SIMD  algorithms  for  such  an  ensemble  of  processors,  and 
analyzed  their  asymptotic  time  complexities.  However,  these  algorithms  are  designed  for 
the  1-D  case,  where  N,  the  size  of  the  problem  is  restricted  to  be  P,  the  number  of  proces- 
sors. In  reallity,  the  general  case  where  A'^>P  is  much  more  likely  to  appear.  In  this  case 
each  processor  stores  L  =  NIP  data  elements  in  its  local  memory.  This  data  structure  is 
referred  to  as  a  2-D  array  of  P  columns  and  L  rows,  where  the  /'th  column  corresponds  to 
the  local  memory  of  the  processor  PEf . 

Our  algorithms  correspond  to  a  SIMD  Perfect  Shuffle  connected  machine  (PS)  [1].  For 
simplicity,  although  it  is  not  crucial,  we  assume  that  P  is  a  power  of  two.  The  processors 
are  ordered  left  to  right  and  are  numbered  0  •  •  •  P  — 1  accordingly,  so  we  denote  the  y'th 
processor  PE;.  Subgroups  of  processors  are  referred  to  in  a  "natural"  manner:  left  half, 
right  half,  odd,  even  etc.  There  are  three  kinds  of  communication  steps:  shuffle  (ct), 
unshuffle  (cr~-^)  and  exchange  {EX)  and  a  processor  internal  (fixed-size)  computation.  All 
of  these  operations  take  one  time  step. 

In  any  parallel  network  machine  we  have  two  possible  working  modes: 
One  in  which  each  processor  computes  a  local  result  of  a  sequential  computation  using  its 
local  data.  Then  an  inter-processor  global  algorithm  is  carried  out  to  compute  the  final 
result.  In  other  words,  this  mode  involves  separate  sequential  work  by  PE's  followed  by  a 
global  parallel  step.  It  is  associated  with  column  order  of  the  array:  first  each  column  is 
processed,  only  then  a  "row"  of  partial  results  is  being  processed. 

In  the  other  mode  each  processor  chooses  an  element  from  its  local  memory.  Then  an 
inter-processor  algorithm  computes  temporal  results  that  are  distributed  back  among  the  pro- 
cessors. This  process  repeats  until  all  elements  in  the  local  memory  of  every  processor  have 
been  processed.  This  mode  corresponds  to  interleaved  processing  phases.  It  is  associated 
with  row  order  of  the  array:  in  each  phase  the  next  row  of  the  array  is  being  processed. 

We  attempt  to  develop  a  set  of  2-D  basic  and  2-D  efficient  algorithms.  In  2-D  Basic  we 
mean  algorithms  which  enable  us  to  manipulate  the  elements  of  the  2-D  array,  such  that  it 
will  fit  our  working  modes.    For  example,  this  include  all  sort  of  row-column  relations. 


These  algorithms,  in  turn,  can  be  used  as  building  blocks  of  more  sophisticated  algorithms 
for  handling  2-D  arrays  on  shuffle  networks. 

We  now  describe  what  2-D  efficiency  means: 

Assume  a  1-D  parallel  algorithm  {N  =  P)  which  requires  O(logiV)  steps,  e.g.  the 
parallel-prefix  (PP)  algorithm  [6].  The  naive  way  of  expanding  any  such  algorithm  to  2-D 
(N  =  L*P)  case,  is  to  simulate  an  N  processor  network  by  the  P  processor  network  that  we 
have.  Since  in  each  processor  we  have  L  elements,  such  a  simulation  needs 
OiL*\ogN)  =  0(L*logL  +  L*logP)  steps.  However  in  [3]  we  find  a  simple  method  of  per- 
forming 2-D  PP  in  0{L*]ogP)  steps.  This  is  achieved  by  performing  1-D  PP  for  each  row, 
starting  with  the  sum  of  the  previous  rows.  Thus  in  the  2-D  case  we  can  expect  to  do  better 
then  the  simulation  of  the  larger  sized  1-D  case.  We  call  an  algorithm  2-D  efficient,  if  it 
performs  better  than  the  naive  simulation  of  the  best-known  same-size  1-D  algorithm.  We 
attempt  to  replace  the  '*\ogP'  term  with  a  '  +  logP'  term  whenever  this  is  possible,  getting 
0(L  +  \ogP)  overall  runtime. 

For  example  consider  the  PP  by  columns:  First  each  processor  computes  the  PP  of  its 
column  elements,  then  apply  1-D  PP  to  the  sums  of  each  column.  Now  each  processor  has 
the  sum  of  the  elements  in  the  previous  columns.  Finly  each  processor  adds  this  sum  to  its 
elements  and  achieves  the  PP  by  columns  in  0(L  +  logP). 

The  following  principles  help  to  achieve  2-D  efficient  algorithms: 

1.  Locality 

As  demonstrated  by  the  above  PP  by  columns  algorithm,  sequential  calculations  at  the 
local  environment  of  a  processor  are  optimal,  i.e.  they  have  0(N/P)  speedup. 

2.  Optimal  use  of  processors 

Using  a  1-D  algorithm  which  calculates  results  of  some  binary  operation  on  the  1-D 
array,  the  number  of  inactive  processors  double  every  cycle.  In  2-D  algorithms  these 
processors  can  be  used  to  process  the  next  row  elements. 

3.  Limited  distribution 

When  distributing  data  elements  to  N  iN>P)  virtual  processors,  it  is  sufficient  to  distri- 
bute them  among  the  P  available  processors  only. 

4.  Virtual  simulation 

Moving  a  bunch  of  elements  from  one  place  to  another  during  the  run  of  an  algorithm 
is  not  always  necessary,  as  other  elements  move  back  later.  This  waste  can  be  elim- 
inated when  simulating  the  run  first,  moving  the  number  of  elements  in  the  bunch 
rather  than  the  bunch  itself. 
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2.  Sort 

In  the  recent  years  there  has  been  an  enormous  amount  of  research  on  parallel  sorting. 
For  an  overview  of  results  see  [10]  and  [11].  For  PS  networks,  two  algorithms  are  relevant 
to  this  work: 

For  LxP  2-D  array,  if  L<P  it  is  best  to  use  the  Oilog'^P)  bitonic  sort  [18],  and  use  the 
merge  and  split  of  Baudet  &  Stevenson  [12]  to  extend  it  to  the  2-D  case.  This  sorting  ter- 
minates after  0(L*log^P+L*logL)  steps.  Another  possibility  is  to  use  the  radix-exchange 
sort  [20]  which  consists  of  21ogP  packing  operations.  Using  the  2-D  packing  described  in 
section  5,  this  algorithm  is  a  8*L*log^P  +  o(L*logP)  steps  2-D  sort.  When  L>P  the  algo- 
rithm of  Cypher  and  Sanz  [9]  which  optimally  sorts  L*P  elements  in  0(L*logP)  steps  is 
better.  Since  there  are  two  cases,  the  value  of  L  must  be  available.  When  L  is  not  known,  it 
is  easily  found  in  O(L-l-logP)  steps,  using  maximum  operation. 

3.  The  Raw-Reduction  Operation 

Assume  that  the  data  for  our  PS-machine  are  in  Lxp  array.  The  2-D  reduction  opera- 
tion calculates  the  result  of  some  binary  operation  bop  (such  as  +,  -,  min  ...  )  for  each  row. 
The  results  are  stored  in  the  first  processor,  thus  we  refer  to  this  operation  as  "all-rows  to 
one-column"  operation. 

The  naive  way  to  achieve  2-D  reduction  is  to  use  PP  on  every  row,  which  requires 
0(L*logP)  steps.  In  order  to  get  a  2-D  efficient  algorithm  we  apply  the  the  "optimal  use  of 
processors"  principle  (described  in  the  introduction). 

The  algorithm  advances  in  L  +  logP  —  1  rounds.  The  rounds  are  equivalent  except  for 
the  following: 

•  The  first  round  is  a  short  one  and  consists  of  steps  (3)  and  (4)  only,  to  be  described 
next. 

•  At  the  rth  round,  /=1    •  •  •    logP,  processors  0    •  •  •    2'°^^  ~  '  —  1  are  inactive. 

•  At  the  /'th  round,  L  <  i  <  L  +  logP ,  processors  P  -  2'~^  +  1    •  ■  ■    P  -  1  are  inactive. 
The  /'th  round  itself  consists  of  four  steps: 

(1)  Every  even  processor  sends  the  step  (2)  result  to  its  ct~^  neighbor. 

(2)  Every  processor  which  got  new  value  at  step  (3),  computes  bop  for  this  value  with  its 
own  value  of  the  /'th  row.  For  the  next  round  the  result  is  its  new  value  of  the  /'th 
row. 

(3)  For  every  odd  processor,  PE;',  PE;  moves  the  value  of  the  /'th  row  to  PE;-  1  using  the 
EX  connection.  Note  that  for  every  individual  processor  the  round  number  (/)  is  the 
number  of  rounds  it  is  active. 

(4)  For  every  even  processor  compute  bop  for  its  /'th  row  value,  and  the  value  it  received 
at  step  (1). 
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At  each  round  a  new  row  becomes  "active",  and  its  remaining  number  of  elements 
starts  to  decrease.  Consider  a  row,  k,  which  is  currently  being  processed.  Note  that  the  set 
of  processors  still  containing  elements  of  k  is  consecutive  and  starts  from  PEO.  Let  Ki  (Kr) 
denote  the  left  (right)  half  of  this  set.  At  each  round  steps  (3)  and  (4),  Kr  is  decreased  by 
half,  with  the  remaining  half  stored  at  the  even  processors.  At  steps  (1)  and  (2)  (of  the  fol- 
lowing round)  this  half  is  folded  on  and  added  to  the  right  half  of  AT;,  so  Kr  is  "free"  to 
proceed  with  the  next  line. 

This  pipeline  process  is  demonstrated  in  Fig.  1,  the  numbers  denote  the  row  number  of 
which  an  element  is  currently  processed.  The  table  is  for  sixteen  processors,  L  =  5.  Note  the 
results  accumulating  in  PEO,  starting  with  the  first  row  at  step  four. 

The  reduction  operation  require  4(L  +  \ogP)  steps.  The  reverse  sequence  of  steps  can 
be  used  for  broadcasting  a  column  stored  at  PEO  to  the  "rest  of  the  world".  This  is  a  2-D 
column  broadcast  operation  which  is  useful  for  many  algorithms. 

The  reverse  operation  is  handy  not  only  for  duplication  but  also  for  generating  dif- 
ferent values.  Note  that  2-D  reduction  spans  a  binary  tree  for  each  row.  By  reversing  its 
steps  we  actually  traversing  the  tree  topdown.  We  name  rop  some  function 
ropix,y)  — -  <l,r>  of  a  local  value  x  and  an  inherited  value  y  to  create  two  different  values 
l,r,  for  the  left  and  the  right  sons. 

3.1.   2-D  Multiple-Broadcast  operation 

Let  each  processor  have  a  column  of  L  elements  in  a  2-D  array.  The  multiple-broadcast 
operation  is  defined  as  the  distribution  all  columns  to  all  processors.  Using  the  reverse  2-D 
reduction  we  get  a  2-D  multiple  broadcast  algorithm  in  0(L*P)  steps: 
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Fig.  1.   2-D  Row-Reduction  of  16-Processors. 
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•  In  L*(P-  1)  steps  all  elements  arrive  at  PEO. 

•  Using  reverse  2-D  reduction  PEO  broadcasts  all  L*P  elements  to  the  rest  of  the  world  in 
4(L*P  +  logP)  steps. 

4.   Parallel  Prefix  by  Raws 

In  the  introduction  it  is  shown  that  PP  by  columns  take  OCL  +  logP)  steps  where  the 
best  known  algorithm  for  PP  by  rows  requires  0(L*logP)  steps  [6].  Here  this  gap  is  closed 
by  an  PP  by  rows  algorithm  which  requires  0(L  +  logP)  steps  only. 

The  algorithm  uses  a  variant  of  the  2-D  reduction  (see  section  3).  Observe  that  the 
reduction  algorithm  virtually  constructs  a  binary  tree  above  the  right  half  of  the  row  of  pro- 
cessors. Skipping  step  (4)  of  the  reduction  iterations,  i.e.  avoiding  adding  the  local  values  of 
the  internal  nodes  of  that  tree,  we  get  a  binary  tree  containing  all  partial  sums  of  the  right 
half.  For  the  z'th  row  we  denote  that  tree  as  Tf.  We  denote  by  Tf  the  tree  obtained  by  exe- 
cuting the  symmetric  algorithm  for  the  left  half,  note  that  the  root  value  is  stored  at 
PE(P  — 1)  rather  than  PEO.  It  is  important  that  both  trees  preserve  the  processor  order  in 
the  following  sense:  if  an  internal  node  contains  the  sum  of  the  values  of  PE/i,  •  •  •  ,  PEm 
then  its  left  son  contains  the  corresponding  sum  of  PE/j,  •  •  ■  ,  PE((«+wi  — l)/2)  and  the 
right  son  contains  that  sum  for  PE((n+m  +  l)/2),    •  •  ■  ,  PEm. 

The  algorithm  is  based  on  the  following  observations: 

(i)  If  we  have  such  an  ordered  binary  tree  for  the  complete  row,  we  could  easily  get  1-D 
PP  for  the  /'th  row.  This  is  done  by  traversing  the  tree  top  down,  where  each  father 
hands  its  inherited  value  to  its  left  son  and  its  inherited  value  plus  its  left  son's  value  to 
its  right  son. 

(ii)  The  same  idea  of  (i)  is  correct  when  the  two  halves  of  the  tree  are  available  separately, 
and  the  root  of  the  right  half  gets  the  value  of  the  root  of  the  left. 

(iii)  Finally,  note  that  given  the  sum  of  the  lower  indexed  rows,  the  /'th  row  2-D  PP  can  be 
achieved  by  the  same  idea  of  (i)  and  (ii). 

The  algorithm  advances  as  follows: 

(1)  Use  the  variant  of  2-D  reduction  to  compute  7f  for  all  /  in  4(L  +  IogP).  Repeat  the 
same  computation  for  Tf.  All  processors  save  up  to  2L  partial  sums  to  use  at  step  (5) 
of  the  algorithm  below. 

(2)  In  L  +  21ogP  steps  move  the  column  of  step-(l)'s  results,  stored  in  PE(P-l),  to  PEO. 

(3)  PEO  computes  the  total  sum  of  each  row  and  PP  for  the  column  of  sums.  This  takes  2L 
steps. 

(4)  In  L  +  21ogP  steps  move  the  resulted  column  at  step  (3)  to  PE(P-l). 

(5)  Reverse  the  tree  constructions  made  at  step  (1),  computing  2-D  PP  for  each  row,  as 
described  in  the  above  observations. 
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5.  The  Transpose  Operation 

The  transpose  operation  transposes  the  elements  of  a  2-D  array  Ly-P  from  row  order 
into  column  order  (Trc)  or  from  column  order  into  row  order  (7^^).  Suppose  we  have  L*P 
array  elements  numbered  1  ,  •  •  ■  ,  L*P .  Row  order  means  that  the  numbering  of  elements 
increases  first  by  columns  and  then  by  rows,  i.e.  PE/  contains  the  following  elements 
{  /  +  p*j  +  \}j=Q  .  .  .  L-\.    In  column  order  PEz  contains  {/*L  +  1  ,    •  •  •    ,  i*L  +  L). 

Note  that  since  our  data  type  (2-D  array)  reflects  the  hardware  structure,  the  result  of 
the  transpose  operation  is  not  the  matrix  transpose  operation.  Repeating  the  same  transpose 
twice  does  not  yield  the  original  array,  so  Trc  "^  T^cr-    An  example  is  shown  in  fig.  2. 

In  [1]  there  is  a  brief  description  of  the  transpose  operation  for  P"*  x  P^.  We  present  a 
general  algorithm  for  the  2-D  case  in  0(L*logP)  steps.  To  see  that  this  is  asymptotically 
optimal  refer  to  section  6.4,  the  packing  lower  bound  described  there  holds  here  too.  When 
L^P  we  could  use  the  sorting  operation  as  described  before  and  get  the  same  order  of  time 
for  the  transpose  operation.  However,  our  algorithm  holds  for  any  L  and  is  of  course,  far 
better  in  terms  of  constants. 

5.1.   Transpose  Rows  to  Columns  T^c 

We  denote  the  location  of  an  element  by  [/,;]  /  =  0,  •  ■  ■  ,L-1  ,  ;'  =  0,...,P- 1,  where;  is 
the  processor  location  and  /  is  its  local  memory  location.    The  new  location  is  [/',/]  where  : 

/'  =  /*P  +  ;■  -  (;■'  -  1)*L  (1) 

Assume  for  simplicity  that  L  is  a  power  of  two.  Let  <i,j>  denote  the  binary  representation 
of  i*P+j.  >From  (1)  ;"  is  equal  to  </,;>  shifted  right  logL  bits.  Thus  ;"  is  equal  to  the 
logP  most  significant  bits  of  <i,j>. 

Moving  from  ;  to  ;"  is  done  by  logP  iterations  of  updating  the  least  significant  bit  of; 
(EX  step),  followed  by  a  ct"1  step  to  cyclicly  shift  the  bits  to  the  right.  Moving  elements 
between  columns,  without  careful  selection  may  cause  collisions  and  congestion  while 
correcting  the  bits. 


123 


4^16 


7|8^ 


atb^ 


transpose 


1^^ 


2|6^ 


317^ 


4|8b 


Fig.  2.    2-D  Trc  operation  for  4x3  array. 
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Partition  the  L  elements  of  each  processor  into  P  groups  of  L/P  elements  each,  such 
that  the  elements  of  each  group  have  the  same  leftmost  logP  bits  of  their  /  index.  Thus  all 
the  elements  in  a  group  have  the  same  destination.  The  groups  are  ordered  by  their  desti- 
nation bits  starting  with  ;  (the  current  processor):  j  ,  j  +  limod  P)  ,  ■  •  •  ,  j  +  P  —  l{mod  P). 
Figure  3  describes  this  ordering  for  two  processors,  7  =  000  and  7=101  (out  of  eight). 

The  elements  in  the  first  group  are  destinated  for  the  current  processor,  therefore  only 
local  transformations  are  required.  For  the  second  group  in  each  processor  we  need  to 
correct  one  bit,  so  one  step  is  sufficient.  In  general,  for  the  f  th  group  we  need  to  correct  up 
to  [log  i\  bits,  so  2*  [log  i\  —  \  steps  are  required.  In  every  even  step  ct  connection  is  used 
and  in  every  odd  step  either  EX  connection  is  used  or  the  element  stays  in  place.  We  say 
that  collision  occurs  when  in  some  step  some  processors  use  EX  and  others  do  not.  That  is, 
when  collision  do  not  occur  during  some  odd  step,  the  number  of  elements  in  all  processors 
does  not  change. 

The  algorithm  advances  in  phases.  In  a  single  phase  every  processor  send  one  element 
of  the  same  group  /  to  its  destination.  The  following  phase  starts  when  all  of  these  elements 
arrive.  The  key  observation  is  that  for  all  of  these  elements  we  need  to  correct  address  bits 
at  the  same  locations.  We  conclude  that  there  can  be  no  collisions  during  a  phase,  since  all 
elements  originate  at  different  processors  and  since  in  odd  steps  either  all  of  them  use  EX  to 
complement  the  right  bit  or  none  of  them  does. 

The  complete  algorithm  sends  elements  by  the  group  order.  The  /'th  group  completes 
after  L/P  phases,  each  phase  takes  at  most  2*[logzl  — 1  steps.  Note  that  an  "odd"  step  in 
which  no  EX  connection  is  taken  may  be  skipped. 

We  now  analyse  the  algorithm  total  run  time.  First  we  count  the  total  number  of  EX 
changes  required  for  all  elements  residing  at  some  processor  PE;  (from  the  above  discussion 
this  number  is  the  same  for  all  processors).    There  are  P  — 1  possible  destinations,  each 


group 

PE  =  000 

PE=101 

1 

000 — 

101—- 

2 

001-— 

100-— 

3 

010-— 

111 — 

4 

Oil — 

no — 

5 

100— 

001 — 

6 

101— 

000 — 

7 

1  lo- 

oil— 

8 

ll  1-— 

010-— 

Fig.  3.   Group  Ordering  for  Two  Processors. 
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determined  by  \ogP  bits.  For  simplicity  we  add  PE;  itself  as  a  destination.  At  average  for 
each  destination  half  the  address  bits  are  different  from  address  bits  of  PE;'.  Hence  the  total 
number  of  EX  steps  is  L/P*P*\ogP/2.  The  above  average  also  implies  that  half  of  the  ct"^ 
steps  are  followed  by  an  EX  step,  so  finally  we  get  a  total  of  (L*logP/2)*3  steps  for  the 
completion  of  the  algorithm. 

5.2.   Transpose  Columns  to  Rows  Tcr 

The  2-D  transpose  from  columns  to  rows  can  be  used  to  scatter  local  results  of  every 
processor  to  all  other  processors.  In  this  context  the  transpose  operation  serves  as  2-D  basic 
algorithm. 

We  use  the  same  method  as  for  Trc  with  the  following  differences: 

..,  =    i*L  +  i   _    , 
J  p  I 

Recall  that  ;  is  the  processor  number  and  /  is  the  row  number.  /  is  the  reminder  <j ,i>/P , 
thus  ;■'  is  the  logP  rightmost  bits  of  <j,i>. 

6.   Packing  and  Unpacking 

6.1.   One-Dimensional  Packing 

Consider  data  elements  dispersed  along  a  row  of  processors.  The  elements  appear  in 
order  with  the  processors  ids,  at  most  one  per  processor.  The  1-D  packing  operation  moves 
the  elements  to  consecutive  processors,  starting  at  PEO,  preserving  the  original  elements 
order.  Schwartz  [1]  and  Nassimi  &  Sahni  [7]  describe  in  general  an  algorithm  for  packing  in 
Shuffle  networks.  We  describe  here  in  details  a  similar  algorithm  to  perform  packing  in 
41ogP  steps,  half  of  them  for  an  initializing  PP  operation. 

In  order  to  perform  packing  of  elements,  the  destination  node  of  each  element  must  be 
determined  first.  This  is  done  in  2\ogP  steps  by  the  parallel  prefix  algorithm.  Then  the 
algorithm  proceeds  with  \ogP  iterations.  At  each  2-steps  iteration  each  element  advances 
"one  bit"  along  the  shortest  path  to  his  destination,  given  that  all  elements  correct  address  bit 
in  the  same  manner  (left  to  right  or  right  to  left). 

We  say  that  a  collision  occurs  if  after  some  step  a  processor  contains  more  than  a  single 
element.  Clearly,  this  can  happen  only  as  a  result  of  an  EX  step,  assuming  no  collision  at 
the  previous  step.    Thus  when  no  collision  occur,  the  algorithm  run  time  is  4]ogP . 

Denote  by  dest(A)  -  the  destination  processor  of  an  element  A,  orig(A)  -  the  processor 
where  A  originated,  loci(A)  -  the  location  of  A  at  the  end  of  the  i'th  iteration.  We  say  two 
elements  are  adjacent  when  they  are  located  in  two  (adjacent)  processors,  having  an  EX  con- 
necting them. 

Consider  two  adjacent  elements,  A  and  B ,  just  prior  to  the  EX  move  of  the  /+  I'th  itera- 
tion.   Note  that   loci(A)  is  different  from  lociiB)  at  the  rightmost  bit  only.    In  order  to  get  a 
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contradiction,  suppose  at  the  EX  move  of- the  /+  Tth  iteration  A  stays  in  place  but  B  joins  A's 
processor. 

•  loCiiA)  is  identical  to  loci(B)  in  logP-/-l  bits,  starting  at  the  second  rightmost  bit. 
These  are  the  \ogP-i-l  leftmost  bits  of  origiA)  and  orig{B).  Thus 
\orig{A)- orig{B)\  <  2''^^  so  at  most  2''^^-2  elements  could  have  resided  between  A 
and  B  initially.   This  implies  \dest{A)-dest{B)\  <  2'"^^ 

•  On  the  other  hand,  loCi(,A)  is  also  identical  to  loCi(B)  in  the  i  leftmost  bits.  If  A  is  adja- 
cent to  B  at  the  beginning  of  the  i  +  l  iteration  and  B  takes  EX  move  but  A  does  not, 
then  clearly  /oc,  +  i(A)  is  identical  to  /oc,  +  i(5)  in  the  /  +  1  rightmost  bits.  These  are  also 
the  i+l  rightmost  bits  of  destiA)  and  dest{B),  so  \dest{A)-dest{B)\  ^  2'"^^  which  is  a 
contradiction. 

Note  that,  using  same  number  of  steps,  the  procedure  can  pack  the  elements  starting 
from  any  processor,  given  that  the  order  is  preserved.   The  correctness  proof  still  holds. 

The  unpacking  is  the  inverse  operation  of  packing,  in  which  each  element  C  has  to 
arrive  to  his  destination  node  d{C).  d  is  a  permutation  which  preserves  order,  that  is 
A<B  =  =  >  d(A)<d(B).  Unpacking  is  achieved  in  21ogP  steps  using  reverse  paths  of  the 
corresponding  packing.  Note  that  whenever  ct~^  is  used  in  the  packing  algorithm,  the 
unpacking  algorithm  uses  the  cr  connections  and  vice  versa. 

6.2.    2-D  Packing 

Suppose  some  of  the  entries  within  the  2-D  array  are  empty.  The  2-D  packing  opera- 
tion is  defined  similar  to  the  1-D  packing  described  above.  The  2-D  array  arrangement  is  in 
row  order  at  any  time  (see  also  section  5).  We  know  of  no  correct  algorithm  for  solving  this 
problem. 

Here  an  0(L*logP)  2-D  packing  algorithm  is  described.  The  algorithm  is  confronted 
with  an  0(L*logP)  lower  bound  for  the  operation,  hence  no  2-D  efficient  packing  algorithm 
exists. 

The  2-D  packing  algorithm: 

(1)  Perform  2-D  PP  in  row  order  to  get  the  destinations. 

(2)  Compute  the  column  destination  and  the  row  destination  of  each  element. 

(3)  Using  2-D  PP,  compare  the  destination  row  of  each  non  null  element  with  the  destina- 
tion row  of  the  previous  one  in  that  row.  Whenever  the  row  destinations  differ,  assign 
this  element  with  ch  =  ch.previous+l,  otherwise  ch  =  ch. previous.  The  first  non  null 
element  in  every  row  always  gets  ch  =  \  (see  fig  6).  PE(P-l)  also  stores  a  column  of 
the  highest  ch  (0,  1  or  2)  of  every  row. 

(4)  Perform  2-D  column  broadcast  of  the  "highest  ch"  column  of  PE(P-l).  (Note  that 
symmetry  of  the  network  allows  us  to  refer  to  PE(P-  1)  as  PEO  and  vice  versa). 
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(5)  Denote  by  c/i,-  the  highest  ch  of  the  f'th  row.  We  perform  L  rounds,  one  for  each  row. 
At  the  /'th  round  we  perform  c/i,-  1-D  packings,  where  c/z,  is  at  most  2.  Note  that  the 
PP  phase  of  the  1-D  packing  is  superfluous  and  thus  may  be  omitted.  An  element  of  the 
I'th  row  participates  in  the  ;"th  pack  of  the  f'th  round  if  and  only  if  its  ch=j.  This 
takes  at  most  4*L*logP  steps. 

The  total  run  time  of  the  algorithm  is  4*L*logP  +  o(L*log/')  steps. 

6.3.   2-D  Unpacking 

The  2-D  unpacking  is  the  inverse  operation  of  2-D  packing.  Each  element 
C  =  <col,row>  is  destined  to  d(C),  where  ^  is  a  permutation  which  preserves  the  2-D  row 
order,  i.e.  A<5    =  =  >   d(A)^diB). 

An  0(L*logP)  2-D  unpacking  algorithm  consists  of  the  following  steps: 

(1)  &(2) 

These  steps  are  similar  to  steps  (3)  &  (4)  of  the  2-D  packing  algorithm,  correspond- 
ingly. 

(3)  This  step  is  similar  to  step  (5)  of  the  2-D  packing  algorithm,  except  replace  1-D  packing 
with  1-D  unpacking.    Note  that  the  highest  ch  here  can  exceed  2. 

We  claim  that  the  algorithm  run  time  is  0(L*logP),  same  as  packing.  For  row  ; 
/=1  •  •  •  L:  let  Ai  denote  the  number  of  different  destination  rows,  and  X,-  the  number  of 
destination   rows   that   do   not   appear   in   any   previous   source   row.    All  X,-   are   mutually 

exclusive.    Since  there  are  no  more  than  L  destination  rows,  ^X,    ^   L  .    Since  each  row 

1  =  1 

can  have  at  most  one  destination  row  common  to  the  previous  row.  A,-  <  X,  +  1.    Summing 
Ai's  over  all  rows: 


Steps 

PEO 

PEl 

PE2 

PE3 

PE4 

PE5 

PE6 

PE7 

row  destination 

2 

2 

2 

2 

2 

2 

3 

changes  check 

1 

1 

1 

1 

2 

row  destination 

2 

2 

2 

changes  check 

1 

2 

2 

2 

row  destination 

1 

1 

1 

■-  — -. — ■-■  - 

1 

changes  check 

1 

- 

1 

1 

1 

Fig.  4.    2-D  Packing:    An  Example  of  Step  (3). 
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t  Ai   ^    i  (^i  +  1)    ^    2L 
j=l  1=1 

Thus,  step  (3)  takes  at  most  2L  iterations  of  1-D  unpacking,  so  the  algorithm  total  runtime  is 
4*L*logP  +  o(L*logP). 

6.4,   2-D  packing  lower  bound 

We  call  two  processors /ar  when  at  least  logP-3  edges  traversals  are  necessary  to  reach 
one  from  the  other.  It  is  an  easy  exercise  to  show  that  any  arbitrary  processor  has  more 
than  P/4  of  the  other  processors  far  from  it.  It  is  also  not  hard  to  construct  a  2-D  packing 
permutation  where  every  processor  sends  an  element  to  all  other  processors,  e.g.  the  2-D 
array  where  an  entry  is  empty  if  and  only  if  it  is  a  diagonal  entry.  In  such  a  case  a  total  of 
at  least  P*/'/4*logP  edge  traversals  are  necessary  for  the  completion  of  the  packing  opera- 
tion. Hence  the  operation  terminates  after  at  least  P/4*logP  steps,  since  only  P  such  edge 
traversals  are  possible  at  a  single  step.  Thus  the  lower  bound  follows  for  certain  values  of 
L. 

7.   The  Smoothing  Operation 

This  operation  moves  elements  among  columns,  to  make  the  height  of  each  column  the 
same.  It  can  serve  as  a  load-balance  operation.  Denote  by  L^ax  the  length  of  the  highest 
column,  Lmin  the  length  of  the  lowest  one.  Denote  by  Dmax  the  biggest  difference  of  all 
column    size   differences   at   odd-even   pairs   (i.e.   pairs   connected   by   an   EX).     Note  that 

Smoothing  is  a  special  case  of  packing,  so  it  can  be  completed  in  OiL*\ogP).  However, 
since  the  preservation  of  order  is  not  necessary,  we  hope  to  do  better.  We  first  present  a 
rather  streightforward  algorithm. 

7.1.   Simple  Smoothing  Algorithm 

The  algorithm  incorporates  (at  most)  logP  iterations.   At  the  /'th  iteration: 

(i)  Find  Dmax,  broadcast  to  all  processors.  Use  EX  connections  to  move  elements  such  that 
every  pair  of  odd-even  processors  has  an  equal  size  column.   This  takes  Dniax/2  steps. 

(ii)  Find  Lmin  and  Lmax  and  broadcast  these  to  all  processors.  If  Lniin  =  ^ma.\  then  stop. 
Each  processor  now  recomputes  its  column  size  to  be  the  actual  current  value  minus 
^min-  ^raax  is  updated  accordingly.  In  L^ax  steps  each  processor  uses  the  cr~^  connec- 
tion to  send  his  column.  This  is  completed  in  at  most  Z-max  steps. 

The  correctness  of  the  algorithm  is  due  to  the  fact  that  the  a~^  connections  "unshuffles" 
2  consecutive  odd-even  pairs  onto  new  2  odd-even  pairs.  Hence  the  number  of  equal  height 
columns  doubles  at  each  iteration. 
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Suppose  only  one  column  is  of  size  L  and  the  rest  are  zero's.  Then  after 
~  2*L  +  A*log^P  steps  the  size  of  each  column  is  LIP .  At  the  other  extreme  case,  let  the 
left  half  of  the  processors  have  columns  of  size  L  and  the  right  half  has  columns  of  size 
zero.  It  is  easy  to  see  that  at  every  phase  there  exist  columns  of  size  L  and  columns  of  size 
zero,  so  in  this  case  the  algorithm  takes  ~L*logP  +  4*log^P  steps. 

This  algorithm  is  subject  to  several  "small"  improvements.  However  it  has  a  funda- 
mental characteristic  which  makes  it  slow:  it  is  nonadaptive  in  the  sense  that  elements  move 
in  one  direction  only  on  ct~^  edges,  full  columns  are  being  moved  in  cases  where  this  may 
be  superfluous  and  local  information  only  is  taken  into  account. 

7.2.   The  Support  Algorithm 

The  following  "Support"  algorithm  seems  to  be  free  of  the  simple  smoothing  algorithm 
flaws.  We  had  no  success  in  proving  its  optimality,  so  we  present  simulation  results  (see 
section  7.3).  All  the  cases  that  we  checked  were  smoothed  by  the  Support  algorithm  (up  to 
3  times)  faster  than  by  the  simple  algorithm.  For  any  reasonable  scope  of  P ,  Support  proves 
to  be  very  efficient.  The  general  question  wether  a  2D-efficient  smoothing  algorithm  exists, 
remains  open. 

Let  Lj^y  denote  the  final  column  height.  The  algorithm  consists  of  logL^v  identical 
phases,  each  involving  logP  rounds.  After  the  first  phase  every  processor  has  at  least  half 
(Lav/2)  of  its  final  set  of  elements.  In  every  subsequent  phase,  the  number  of  elements 
missing  at  the  lowest  column  processor  is  decreased  by  half.  Moreover,  the  hight  of  the 
highest  column  is  also  decreased  by  half  at  the  termination  of  each  phase,  so  in  terms  of  the 
highest  column  size,  the  Support  algorithm  performance  is  twice  the  performance  of  the 
first  phase.    We  describe  in  details  the  first  phase. 

At  the  beginning  of  the  round,  the  EX  connections  are  used  for  equalizing  the  sizes  of 
"pairs"  of  odd-even  columns.  Now  each  column  is  split  into  two  halves.  We  call  one  collec- 
tion of  all  halves  of  all  columns  Hi  and  the  other  H2-   Note  that 

\Hi\    =    \H2\  (1) 

and  that  is  how  it  is  going  to  stay  throughout  the  phase.    Note  also  that,  for  both  sets 

the  total  number  of  elements  residing  at  odd  processors. bris  equal  that  of  even  processori.'^) 

Although  the  actual  set  of  elements  of  Hj  change  during  the  run  of  algorithm,  the  notion  of 
which  elements  belong  to  H\  at  each  step  and  which  to  Hi  will  be  obvious. 

Let  /'  denote  logP  — /.  Let  5,-  be  the  collection  of  all  sequences  of  2''  consecutive  pro- 
cessors.   By  a  "sequence"  of  processors  of  length  2™,  we  always  mean  that  counting  starts  at 

one  of  the  processors  0,2'" P-l"".    The  phase  advances  in  rounds:    at  the  end  of  the 

/'th  round  Hi(2)  =  Hi„od2  is  evenly  distributed  among  the  elements  of  Si.    In  other  words,  if 
L/,,(2)     is     current    number     of    elements     in     PEl     belonging    to     //,(2),     then     for    all 
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it  =  0,2''. 


.P-X 


2    (  ^k+t,i(2)  ~  Lav/2  )    =    0 
/=o 


(3) 


Clearly  after  ~logP  rounds  smoothing  of  one  of  the  H  sets  is  completed. 

During  the  /'th  round  we  use  elements  of  //,-i(2)  to  "support"  //i(2)  such  that  (3)  holds. 
Note  that  (3)  implies  (1).  Then  //,(2)  uses  the  EX  connections  so  that  (2)  holds.  Thus,  at  the 
end  of  the  /'th  round,  both  (3)  and  (2)  hold  for  //j(2)-  This  implies  a  strengthened  versions 
of  (3):   for  all  k  =  0,2'' P-2'' 


Z     (  Lk+2i,i(2)  ~  Lav/2  )    =    0     , 


(4) 


and 


2l       (  ^ik  +  2/+l,,(2)   ~   LaV' 


1  =  0 


,/2)    =    0 


(5) 


Finally  observe  that,  using  ct"^  connections,  a  subsequence  of  odd  (or  even)  processors  of 
some  sequence  in  5,  is  connected  to  a  single  sequence  of  Si+i.  Also  every  sequence  in  S,+i 
is  "covered"  by  such  subsequence. 

Figure  5  describes  the  first  2  rounds  of  this  process.  "*"  denotes  all  processors,  "1*0" 
denotes  all  processors  starting  with  "1"  and  ending  with  "0",  i.e.  all  even  processors  at  the 
left  side.   The  "US"  (ct"^)  edges  demonstrate  the  support  operations  between  Hi  and  H2-   At 


EX/        \EX 

1*0  (H2)  1*1  (H2) 

US 
01*  (HI)  11*  (HI) 


Fig.  5:  the  Support  algorithm 
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the  beginning  of  round  i  +  l  this  property  is  used  so  //,(2)  "supports"  Hi+i(2)  such  that  (3) 
holds,  and  so  on  and  so  forth. 

We  now  describe  how  ^,(2).  obeying  (5)  and  (4)  at  the  beginning  of  the  f+l'th  round, 
"supports"  Hi+i(2)  so  that  (3)  holds.  Consider  a  forest  which  is  all  subtrees  of  hight  i'  of  a 
complete  tree  of  hight  logP.  Each  subtree  calculates  the  difference  between  the  total 
number  of  elements  of  //,  +  i(2)  in  a  sequence  of  5,  +  i  to  its  corresponding  total  number  of  ele- 
ments of  Hi(2)  in  the  corresponding  odd  (or  even)  subsequence  of  5,-.  This  difference  deter- 
mines how  many  elements  should  //,(2)  move  to  Hi+i(2)  (or  vice  versa)  through  the  cr"^  con- 
nections. Going  "back",  down  the  tree,  each  father  splits  the  work  of  moving  elements 
between  its  children  and  according  to  their  partial  sums.  This  procedure  of  "control"  takes 
~6/'  steps.  The  total  time  spent  in  the  control  procedure  throughout  the  run  of  the  algo- 
rithm is  0(log2p*logLmax)- 

Our  last  observation  is  that  the  control  procedure  may  run  independent  of  moving  the 
elements.  It  is  actually  performing  a  simulation  of  the  algorithm  itself,  given  the  initial 
number  of  elements  at  each  processor  as  input  and  giving  the  total  flow  and  direction  of 
flow  at  every  edge  as  output.  Thus,  to  improve  the  total  run  time,  we  first  perform  the  con- 
trol procedure,  saving  the  data  about  the  movement  of  elements  on  every  edge.  Then,  for 
each  edge,  a  number  which  is  the  total  sum  of  elements  traversing  it  is  calculated,  where  two 
opposite  traversals  cancel  each  other.  Only  then  the  flow  of  elements  starts:  a  processor 
sends  an  element  using  an  out-edge  if  (and  only  if)  it  has  an  element  and  the  edge  "control 
number"  is  positive. 

7.3.   Simulation  Results 

We  know  the  control  procedure  takes  0(log^P*logLraax)  steps.  The  question  remained 
is:  how  fast  is  the  flow  part?  The  flow  is  carried  out  in  an  optimal  pipeline,  what  about  the 
involved  paths?  To  gain  more  information  we  ran  simulations  of  the  algorithm,  measuring 
the  time  of  the  flow  part  only.  Early  results  showed  that  the  worst  initial  distributions  con- 
sist of  ~P/2  columns  of  height  L  =  Lniax  and  the  rest  of  height  0.  Since  we  are  interested  in 
worst  case  analysis,  these  are  the  initial  distributions  we  have  chosen,  where  the  location  of 
the  full/empty  columns  is  randomly  chosen. 

The  scope  of  P ,  achieveable  by  our  Sun-3  workstation,  is  limited  to  2^^.  However,  this 
range  turned  out  to  be  sufficient  in  getting  a  clear  tendency  of  the  results.  Even  if  this  ten- 
dency does  not  hold  for  all  P,  it  does  ,  probably,  for  up  to  P  =  2'^  at  least.  We  tried  2  sizes 
of  L:  1024  and  4096.  To  normalize,  the  total  number  of  steps  was  devided  by  L.  For  each 
value  of  logP  =  8,  ...  ,17  the  average  of  10  experiments  is  presented.  Most  deviations  from 
the  averages  presented  were  small:    the  highest  of  them,  0.55,  was  a  real  exception. 

The  results  suggest  a  linear  growth  of  the  flow  part  run-time  and  logP,  so  the  algorithm 
behaves  asymptotically  as  L*logP.  Note,  however,  that  the  constants  make  the  algorithm 
competitive  with  a  5*L  algorithm  at  the  feasible  range  of  P  (taking  the  control  part  into 
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account).  Moreover,  when  the  number  of  columns  is  much  less  or  much  more  then  Pll  or 
when  it  is  not  a  zero/full-column  distribution  then  the  performance  is  a  lot  better.  The  simu- 
lation results  are  presented  in  figure  6,  where  T  is  the  number  of  steps  divided  by  L  the 
coulmn  hights.  One  can  see  the  difference  between  the  simple-algorithm  and  the  help  algo- 
rithm. 

7.4.   The  Dup  Operation 

The  2-D  dup  operation  duplicates  each  element,  e,  producing  extra  copies  according  to 
a  tag  number  (e.c)  associated  with  e.   The  resulting  total  number  of  elements  is  D  =   y,  ^-C- 

air? 

Note  that  there  is  no  requirement  on  the  final  distribution  of  the  copies.    Let  D,-  =      V     e.c 

e  inPEi 

and  Li  the  number  of  elements  in  PEf   before  duplication. 

For  the  1-D  case  Schwartz  [1]  suggested  to  use  PP  to  determine  the  place  of  each  ele- 
ment in  the  new  array,  then  use  sorting  to  move  each  element  to  its  place,  and  finally  PP  by 
groups  to  duplicate  each  element  in  the  space  between  every  two  original  elements. 

For  the  2-D  case  smoothing  is  used  to  equally  distribute  the  duplication  work  among 
the  processors.  Then  D/P  more  steps  are  required  to  complete  the  duplication  at  each  pro- 
cessor locally.  Note  that  in  the  course  of  the  smoothing  operation  it  may  be  necessary  to 
duplicate  e  with  a  tag  e.c  to  ex  and  ^2  and  assign  them  with  new  tags  ei.c'  and 
e2-{e-c  —  e\.c').  This  duplication  might  cause  the  size  of  (real)  columns,  L,-,  to  increase 
which  might  increase  the  smoothing  runtime  too.  However,  in  the  next  subsection  a  way  is 
shown  how  to  equalize  D,-  and  Dj  for  two  adjacent  columns,  where  L,  and  Lj  increase  by  at 
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Fig.  6:    Smoothing  simulation  results. 
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most  1.  Hence  the  real  column  size  increases  by  at  most  log?  so  the  total  dup  algorithm 
runtime  is  0(Smooth(L)  +  D/P),  where  L  is  the  size  of  the  highest  column  before  duplica- 
tion and  Smooth(L)  is  the  cost  of  smooth  (see  section  7.5).  When  D/P>L,  this  algorithm  is 
better  than  the  minimum  of  0(D/P*logP  +  log^P)  steps  required  by  adopting  the  1-D  algo- 
rithm to  the  2-D  case. 

7.5.   Dup  -  Smoothing 

Given  two  adjacent  processors,  PEj  and  PE;',  it  remains  to  show  that  their  "D-heights" 
can  be  equalized  such  that  their  "L -heights"  increase  by  at  most  1.  It  is  easier  to  see  for  the 
simpler  smoothing  algorithm  (section  7.1).  Before  the  beginning  of  the  algorithm  each 
column  is  sorted  by  tags.  The  columns  are  kept  sorted  from  this  point  on,  throughout  the 
algorithm.  Consider  step  (i)  where  all  odd-even  pairs  of  processors  equalize  their  column 
D-heights.    Let  PE/,  PEi  +  1  be  such  a  pair. 

•  While  the  L-heights  are  different,  move  elements  from  the  higher  column  to  the  lower 
one,  preserving  the  internal  column  order  sorted  by  tags. 

•  For  Di  >  D,  +  i,  denote  Top(i)  the  element  with  the  largest  tag  in  PEz's  column  and 
Bottom{i  +  \)  the  element  with  smallest  tag  in  PEf  +  l's  column.  Obviously 
Topii)  >  Bottom(i  +  \).  Preserving  the  internal  column  order  sorted,  PEz  and  PE/4-1 
exchange  these  two  elements.  Keep  exchanging  until  D,  +  i  becomes  larger  then  or 
equal  to  D,-.  If  D.  +  i  >  D,-,  then  the  last  element  moved  from  PE/  to  PE/-H1  can  be 
split  into  two,  such  that  the  return  of  one  of  them  to  PE/  makes  the  D-hights  equal. 

For  the  new  step  (i)  of  the  simple  smoothing  algorithm,  the  final  L-heights  are  at  most  the 
average  of  the  L-heights  of  initial  columns  plus  one.  Moreover,  the  preservation  of  order  is 
easily  kept,  using  a  merge-like  scan  of  the  columns,  in  0(L)  steps.  The  overall  algorithm 
runtime  is  0(L*logL  -I-  L*logP  +  log^P),  regardless  of  the  D-heights. 

8.    Cartesian  Product 

Assume  that  the  elements  in  the  2-D  array  are  divided  into  two  groups:  A  and  B.  Then 
the  cartesian  product  (CP)  A  x  S  operation  produces  all  pairs  <a,b>,  such  that  a  belongs 
to  A  and  b  belongs  to  B .   There  is  no  restriction  of  the  final  distribution  of  pairs. 

In  [1]  there  is  an  algorithm  for  1-D  CP  for  groups  of  size  at  most  P^.    A  naive  adapta- 

\a  I*  \r  I 
tion  of  this  algorithm  to  the  2-D  case  requires  at  least  0(-' — W — '-*logP)  steps.    Suppose 


[a  |>  |5  I  and  \A  \>P ,  and  assume  A  is  smoothed,  i.e.  each  processor  has  -^^  of  A's  el 


P 


ements. 


The  following  algorithm  achieve  2-D  CP  of  A  and  B  in  0(  ''^l*'^l    +  |5|)  steps: 

(i)     Use  2-D  broadcast  to  produce  a  copy  of  each  element  of  B  in  each  processor.    This 
takes  at  most  2*(|S|  +  logP)  steps. 
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(ii)    Locally,  each  processor  creates   <a,b>   for  every  a   which  belongs  to  A's  elements 

\a  I*  Is  I 

stored  in  it  and  every  b  in  B.   This  takes  -^ — y — ^  steps. 

8.1.   Cartesian  Product  by  Groups 

Given  list  of  CPs  CPBG  =  {AiXBi,  ■  ■  ■  Ar^^r  }.  the  cartesian  product  by  group 
(CPBG)  operation  produces  all  pairs  of  all  the  products  in  CPBG.  The  following  assump- 
tions are  natural  for  many  cases: 

•  r  is  "sufficiently"  small,  i.e.  r«P  and  r<<L. 

•  The  CPBG  is  initially  known  for  all  processors,  including  group  sizes.  Here  we  also 
assume  that  CPBG  is  given  such  that  all  processors  discriminate  groups  appearing  at  the 
"S-side"  from  those  appearing  at  the  "A-side".  The  CPBG  is  stored  in  the  following 
data  structure:  for  every  "A-group"  Aj  there  is  an  ordered  list  Lj  containing  all  "B- 
groups"  that  are  "taken  product"  with  Aj  in  CPBG,  i.e.  G  is  in  Lj  if  and  only  if  Ajy-G 
belongs  to  CPBG. 

•  The  groups  are  mutually  disjoint,  i.e.  elements  of  different  groups  may  have  same 
value  but  always  different  "id"s. 

The  following  CPBG  algorithm  is  a  close  variant  of  the  CP  algorithm: 

(i)  All  groups  at  the  "S-side"  are  broadcasted  to  all  PEs,  not  repeating  same  group- 
broadcast  twice.  Note  that  this  enables  an  ordering  of  the  elements  of  each  "S-group" 
consecutively,  which  is  equivalent  at  all  processors. 

(ii)  Let  e  be  an  element  of  a  group  Aj  where  Aj  is  at  the  "A-side".  Let  \Lj\  denote  the  total 
number  of  elements  of  all  groups  in  Lj.  e  is  associated  with  a  label  containing  its  group 
id  and  a  range:  e.Aj.<\  ■  ■  ■  \Lj\> .  The  range  of  e  corresponds  to  the  (ordered)  ele- 
ments from  the  (ordered)  groups  of  Lj  with  which  it  is  to  be  paired. 

(iii)  The  labeled  elements  are  "smoothed"  by  a  variant  of  the  smoothing  incorporated  in  the 
dup  operation  (section  7.5).  Whenever  the  smooth  operation  requires  a  spliting  of 
e.Aj.<x  ■  ■  ■  y>  into  ei  and  62,  the  range  of  e  is  split  into  disjoint  "segments": 
e\.Aj.<x  ■  ■  ■  z>  and  e2-Aj.<z+\  ■  ■  ■  y>.  The  result  of  this  step  is  an  even  distribu- 
tion of  the  "pairing"  work  among  all  processors. 

(iv)  Each  processor  generates  all  paires  of  "A-elements"  with  their  "5-partners",  specified 
by  their  labels. 

r  |5-|*1A| 

The  total  number  of  steps  required  is  at  most  ~  LA*\ogP  +   V  (  |S,|  +  _         )  where  L^ 

1  =  1  ^ 

is  the  highest  column  at  the  "A-side". 
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9.   Discussion  and  Summary 

The  following  table  summarizes  performance  of  2-D  efficient  algorithms  presented  in 
this  work.    Only  dominating  terms  are  given. 


Operation 


Perfect  Shuffle  Time 


Raw-Reduction 


4(Z.  +  logP) 


Multiple-Broadcast 


5*L*P 


Parallel-Prefix 


20(L  +  logP) 


Transpose  Rows  to  Columns        l.5*L*logP 


1-D  Packing 


4*logP      (2*logP  for  Unpacking) 


2-D  Packing  and  Unpacking         4*L*logP 


Smoothing 


L*logP      (no  matching  lowerbound) 


Dup 


0(L*logP +D/P) 


Cartesian  Product  of  A  and  B        ''^y^l   +  5*  |S  | 


Cartesian  Product  by  Groups 


L*logP  +  2(|5il  +  i^i^M) 


We  gave  several  2-D  basic  algorithms,  several  of  them  where  2-D  efficient.  In  some 
cases  it  was  not  possible  for  us  to  do  better  then  0(L*logP).  It  turned  out  that  the  2-D  effi- 
cient algorithms,  such  as  row  reduction  or  parallel  prefix,  are  those  related  to  problems 
which  do  not  involve  permutations.  On  the  other  hand,  problems  for  which  the  best  algo- 
rithm is  provably  inefficient,  such  as  the  packing  or  transpose  operations,  clearly  are  special 
cases  of  2-D  permutations.  Note  that  smoothing  is  not  a  spesific  2-D  permutation  but  rather 
can  be  considered  as  a  collection  of  such,  from  which  one  has  to  be  chosen.  The  question 
remained  open  wether  smoothing  can  be  done  2-D  efficiently. 
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An  interesting  model  to  compare  with  is  the  Layered  Shuffle.  This  model  incorporates 
logP  layers  of  P  swithches,  each  connected  in  a  shuffle  pattern  to  the  next  level.  The  last 
and  first  levels  are  connected  to  the  PS  network,  so  the  leveled  structure  is  cyclic.  Switches 
are  simple:  a  "limited"  routing  capability  and  one  buffer  of  size  O(logP)  bits.  On  such  a 
machine  the  cases  of  2-D  permutations  considered,  packing,  unpacking,  Trc  and  Tcr,  become 
2-D  efficient.  0(L  +  logP)  algorithms  follow  from  the  algorithms  presented  above.  Smooth- 
ing is  efficient  as  a  special  case  of  packing. 

The  set  of  algorithms  that  we  have  chosen  to  focus  on,  are  such  that  efficiently  support 
data  manipulation  on  a  SIMD  machine.  Such  algorithms  can  be  used  as  a  basic  operations 
on  such  a  machine.  The  fact  that  several  of  these  algorithms  are  provably  optimal,  can 
motivate  two  other  lines  of  research:  one  is  to  implement  a  SIMD  machine  with  these  algo- 
rithms as  instructions  of  some  programming  language.  The  second  is  to  expand  the  MIMD 
PRAM  machine  to  support  SIMD  mode,  and  to  switch  between  these  modes. 
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